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ABSTRACT 

Chromatin immunoprecipitation coupled with 
massive parallel sequencing (ChlP-seq) is increas- 
ingly used to map protein-chromatin interactions at 
global scale. The comparison of ChlP-seq profiles 
for RNA polymerase II (Polll) established in different 
biological contexts, such as specific developmental 
stages or specific time-points during cell differenti- 
ation, provides not only information about the 
presence/accumulation of Polll at transcription 
start sites (TSSs) but also about functional 
features of transcription, including Polll stalling, 
pausing and transcript elongation. However, anno- 
tation and normalization tools for comparative 
studies of multiple samples are currently missing. 
Here, we describe the R-package POLYPHEMUS, 
which integrates TSS annotation with Polll enrich- 
ment over TSSs and coding regions, and normalizes 
signal intensity profiles. Thereby POLYPHEMUS fa- 
cilitates to extract information about global Polll 
action to reveal changes in the functional state of 
genes. We validated POLYPHEMUS using a kinetic 
study on retinoic acid-induced differentiation and a 
publicly available data set from a comparative Polll 
ChlP-seq profiling in Caenorhabditis elegans. We 
demonstrate that POLYPHEMUS corrects the data 
sets by normalizing for technical variation between 
samples and reveal the potential of the algorithm in 
comparing multiple data sets to infer features of 
transcription regulation from dynamic Polll binding 
profiles. 



INTRODUCTION 

Based on a technological leap in the development of 
sequencing technologies, we are currently facing a switch 
from gene centric to global analyses. However, the con- 
comitant development of bioinformatics tools that allow 
for comparative functional analyses is lagging significantly 
behind, such that a lot of presently available data have not 
yet been exploited to gain maximal functional insight. 
Chromatin immunoprecipitation coupled with massive 
parallel sequencing (ChlP-seq) is one of these 
technologies, which is increasingly used to define biologic- 
ally relevant processes like (the dynamics of) chromatin 
modifications and constituents at genome-wide scale, the 
association patterns of (post-translationally altered) chro- 
matin modifiers or chromatin interacting proteins, such as 
transcription factors or RNA polymerases. Multiple com- 
putational approaches dedicated to ChlP-seq have been 
developed to (i) generate signal intensity profiles by 
cumulating sequenced reads aligned to the genome and 
(ii) identify significant 'peaks' over the reconstructed 
profile (1). 

For RNA polymerase II (Polll) the corresponding 
ChlP-seq profile is the composite of at least two function- 
ally different aspects; a strong and well-defined binding to 
a given transcription start site (TSS) where Polll is 
observed even in absence of transcriptional activity (2) 
and a second composite pattern resulting from Polll 
action subsequent to transcription initiation at a 
given TSS, which comprises several regulated events 
like transcript elongation, pausing and termination 
(Supplementary Figure SI). Therefore, any comparative 
study of transcriptional activities at distinct gene loci 
should consider the global behaviour of read-count 
signal intensities spread over the entire loci and not rely 
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only on the comparison of PolII occupancies at TSSs. 
Moreover, such a comparison should provide a 
genome-wide read-out that helps to gain insight into the 
regulated functional aspects of PolII subsequent to the 
initiation of transcription. Finally, variations inherent to 
the technology (i.e. sequencing depth) should be con- 
sidered in the comparison of different data sets by 
applying adequate normalization procedures as for the 
differential expression analysis of mRNA-Seq assays (3). 

Here, we describe POLYPHEMUS, an R package that 
integrates ChlP-seq derived PolII binding site information 
with gene annotation to identify coding regions associated 
with transcriptional activity and compare changes of these 
activities in biological contexts. For this, POLYPHEMUS 
performs gene length standardization and read-counts in- 
tensity of non-linear normalization of multiple data sets 
before comparison with correct technical/experimental 
variations, which could cause problems for comparative 
data analysis. We show here through the analysis of 
primary data and meta-analyses of published data sets 
that POLYPHEMUS can be used as an integral part of 
an analytical pipeline (Supplementary Figure S2) for the 
comparative analysis of multiple PolII ChlP-seq data sets 
to gain functional insight about the differential activity of 
gene networks activities in different biological contexts. 



MATERIALS AND METHODS 

Overview 

POLYPHEMUS is based on an integrative approach 
which combines, in a user-directed manner, information 
from several levels, as is schematically illustrated in 
Supplementary Figure S3, (i) As the first step 
POLYPHEMUS combines peak calling outputs from 
PolII ChlP-seq experiments with coding region database 
annotations; from this information, signal intensity 
profiles are extracted, which reveal those coding regions 
at which PolII was identified by ChlP-seq. (ii) The signal 
intensity profiles are scanned with a sliding window, thus 
producing smoothened sliding window intensity profiles, 
(hi) To compare two distinct PolII ChlP-seq experiments 
both their sliding window intensity profiles and (iv) the 
coding region lengths are normalized, (v) Finally, 
normalized profiles are displayed as a relative differential 
enrichment for PolII association. The procedures 
developed for peak calling/coding region data integration, 
normalization and standardization are described in detail 
below. 



Identification of RNA PolII-enriched coding regions 

As a first step, we identify chromatin sites to which PolII is 
bound. For this, we use MeDiChl, a model-based decon- 
volution approach originally developed by Reiss et al. (4) 
to study ChlP-ChIP profiles and which has been adapted 
for ChlP-seq data analysis (5). Whereas other peak caller 
outputs can be used together with POLYPHEMUS, 
MeDiChl provided an efficient manner to annotate sig- 
nificant PolII-enriched regions thanks to a peak-shape 
learning process that is performed before annotation of 
enriched-regions [illustrated in Supplementary Figure S4 
in comparison with the widely used peak caller 
'Model-based analysis of ChlP-Seq (MACS)' (6)]. 

Subsequent to PolII binding site identification 
POLYPHEMUS correlates peak positions with a coding 
region annotation database for the organism of interest, 
such as RefSeq (7). For this, the genomic locations of the 
identified PolII peaks are compared with annotated 
Transcription Start Sites (TSS) within a user defined 
window (default ±300 bp) around peak centres. The 
overlap identifies coding regions for which the ChlP-seq 
analysis displays significant enrichment of PolII at the 
TSSs. Together with the signal intensity wiggle files, this 
information is used to extract read-count intensities along 
the corresponding coding regions. To smoothen the PolII 
ChlP-seq profile over the gene bodies, a user-defined 
sliding window (default 250 bp) scans the concerned 
coding regions to compute a median sliding-window in- 
tensity (SWI). User-defined buffer regions (default 500 bp) 
upstream and downstream of the concerned genes are 
included in the analysis to include ChlP-seq-defined 
PolII binding that extends beyond annotated coding 
regions. Finally, the orientation of genes encoded by the 
negative strand is inversed to facilitate the comparative 
analyses in the subsequent steps. 

Normalization of RNA PolII profiles 

Before comparing the signal intensities within ChlP-seq 
data sets, it is essential to know if their global amplitudes 
are indeed comparable. Considering that the amplitude of 
ChlP-seq profiles is directly proportional to the total 
number of mappable reads (TMRs), previous studies 
have normalized different samples by linear correction 
with a scaling factor that adjusts for TMRs between 
samples (8-12) (Table 1), following the assumption that 
the differences in the TMRs uniformly affect the ampli- 
tude of the profile. To assess whether this assumption is 
valid, we displayed the SWI distribution pattern of 



Table 1. Normalization approaches used for the analysis of RNA Polymerase II ChlP-seq profiles 



Normalization method 


Approach 


Software 
implementation 


References 


Sequencing depth 


TMRs uniformly equalized relative to the sample with the lower 


No 


(9,10) 




number of reads 






Linear scaling 


The average reads count in a defined bin is divided by the TMRs 


No 


(7,8,11) 


LOWESS 


Locally weighted polynomial least square regression applied to 


No 


(13) 




estimate the mean and variance between the compared data sets 






Quantile/LOWESS 


Described in this study 


POLYPHEMUS 


This manuscript 






(R package) 





Page 3 of 1 1 



Nucleic Acids Research, 2012, Vol. 40, No. 4 e30 



compared profiles as minus versus average (MA) trans- 
formation plot, which is frequently used in microarray 
data analysis (13). Importantly, we observed that the dif- 
ferences in the TMRs can result in rather dramatic 
non-linear deviation of the compared SWIs (for 
example, see Figure IB top and bottom panels), indicating 
that a reliable comparison of ChlP-seq data sets with dif- 
ferent TMRs could require in certain cases more 
sophisticated procedures than linear scaling. 

This issue has been addressed recently by applying 
locally weighted polynomial least square regression 
(LOWESS) to estimate the smoother line of the mean 
and the variance of the observed data (14). 
POLYPHEMUS has LOWESS functionality integrated, 
but to include the possibility of comparing multiple 
profiles, we implemented, in addition, a quantile normal- 
ization option. The rationale for this is that while 
LOWESS and quantile normalizations produce similar 
results, there are two limitations when using LOWESS. 
First, span conditions to obtain the best smoothening 
(proportion of points used to compute) need to be empir- 
ically evaluated, thus making automation impossible 
and second, LOWESS requires high computation time, 
which is a serious disadvantage when dealing with next- 
generation sequencing data in a genome-wide context. 
Note that the implementation of LOWESS normalization 
in POLYPHEMUS follows a similar procedure as 
described (14). 

Quantile normalization. Quantile normalization relies on 
the assumption that the majority of coding regions present 
the same transcriptional activity across the compared 
experimental conditions, which reflects a common PolII 
association pattern to constitutively active genes. 
Correspondingly, the quantile normalization adjusts the 
distribution of SWIs for different samples to reach a 
common distribution pattern (15), by the following 
procedure: 

For N ChlP-seq data sets with n PolII-enriched coding 
regions, each of which comprising Z sliding windows: 

(i) build a matrix M of size Kx N, where each column 
is a ChlP-seq data set (AO and where each row cor- 
respond to the SWIs per coding region. Note that 
the total number of rows are defined by 

(ii) sort each column of M to give M sort 

(iii) take the means across each row of M sort and scale 
(i.e. divide) each SWIs with this value to get M' sort 

(iv) get the final matrix M norra by rearranging each 
colum of M' sort to have the same ordering than M 

In contrast to LOWESS, this approach can be extended 
to more than two samples, which, for instance, allows for 
the comparison of samples from kinetic analyses. The 
current POLYPHEMUS version handles multisample 
normalization. 

Comparing normalized ChlP-seq profiles. Subsequent to 
normalization, compared profiles are expressed by the 
ratios of their corresponding normalized SWIs. To 
correct for variations between contiguous SWI ratios, we 



fit the distribution of each coding region-specific ratios 
with a LOWESS-smoothened line. Each ratio is then 
interpolated to the fitted line, defined as fitted SWI ratios. 

Defining TSS/gene body regions. Different types of tran- 
scription regulation-relevant information can be extracted 
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Figure 1. Comparison of RNA PolII ChlP-seq profiles requires 
non-linear normalization. Meta analysis of C. elegans PolII profiling 
by ChlP-seq (10). (A) The signal tracks for chromosome V illustrate the 
different samples which are compared in this study. Display of two 
biological replicates (suffix 'repl' or 'rep2') of samples from embryo 
(E; blue tracing) or LI larval (LI; red and pink tracings) stages. For the 
biological replicate 2 of the LI sample two technical replicates 
(Llrep2A and Llrep2B) are displayed. TMR for these samples range 
from 2-8 million. Upregulated (red arrow), constitutive (black arrow) 
and downregulated (green arrow) PolII binding at given loci can be 
intuitively detected by visual inspection of the profiles, as different 
signal intensity scales are displayed (1:200 for the LI high TMR 
sample and 1:70 for samples with about 2 million TMR). (B) MA 
plots. The fitted LOWESS curve (red line) in the prenormalized MA 
plot for Erep2 versus Erepl (top left) reveals the need for nonlinear 
normalization before data comparison. LOWESS (top center) or 
quantile normalization (top right) was applied to enable data compari- 
son. MA plots for the technical replicate (Llrep2B versus Llrep2A) 
and a biological replicate (Llrep2A versus Llrepl) are also illustrated 
before or after LOWESS/quantile normalization. 
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from PolII association at TSS and gene body regions when 
studying gene regulation. Therefore, POLYPHEMUS has 
been designed such that binding at the TSS and the 
pattern observed along the coding region of a gene 
provide separate readouts. This approach facilitates 
gene classification according to local PolII activities, 
such as stalling or productive elongation (illustrated in 
Figures 3A and B and 5A and C). Inspired by previous 
studies describing the PolII enrichment around TSS sites 
of annotated genes (16-18), we have defined a 250-bp 
distance from the TSS as being indicative of relevant 
PolII-TSS binding (user-defined parameter in 
POLYPHEMUS). The remaining part of the coding 
region is defined as 'gene body'. 

Gene length standardization. To compensate for the highly 
variable length of gene coding regions POLYPHEMUS 
normalizes gene lengths. Consequently, an equal number 
of data points define all coding regions, which is an essen- 
tial prerequisite for comparative analyses. The procedure 
is performed as following: for a given gene body composed 
of Z sliding windows, where Z, corresponds to their pos- 
itions in the coding region of interest, the body-length 
standardization to a reference length L (in sliding 
window units) is performed by the transformation 

1 - L - 



where /j correspond to the standardized positions of the 
sliding windows in the gene-body. As each sliding window 
presents a given SWI, such information is represented in 
the context of their standardized positions (/j) and fitted to 
a LOWESS-smoothened line, which is then used to inter- 
polate the number of data points that will represent the 
body gene characteristics for PolII binding. 



Classification of PoIII-occupied coding regions 

POLYPHEMUS (i) combines identified PolII binding 
sites with signal intensity profiles, (ii) normalizes sample 
data sets for subsequent comparison and (iii) standardizes 
different coding regions such that comparative 
intracoding region and intersample analyses become 
possible. Given the existence of efficient tools for data 
clustering, we did not integrate such option into 
POLYPHEMUS. Rather, POLYPHEMUS generates a 
versatile matrix output (in addition to MA plots and in- 
tensity tables), in which columns correspond to 
normalized and standardized sliding window intensity 
(SSWI) ratios and lanes to the corresponding coding 
regions (Supplementary Figure S5). This matrix can be 
uploaded in tools like MultiExperiment Viewer (MeV) 
(19,20) to perform supervised or unsupervised gene clus- 
tering analysis based on normalized SWI ratios covering 
the corresponding coding regions, as exemplified below 
and depicted in Figures 2, 3 and 5. 
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Figure 2. Analysis of differential chromatin-association of RNA PolII at different developmental stages requires data normalization. Comparison of 
ClilP-seq data sets for larval and embryonic stages for (A) high TMR difference (Llrep2A versus Erepl: 8.5 versus 2.1 million reads) and (B) similar 
TMR (Llrepl versus Erepl: 2.0 versus 2.1 million reads). (Top panels) MA plots are shown before ('Prenormalization') and after normalization 
('quantile'). (Bottom panels) SOTA (max cycles = 9; cell variability, P = 0.01) to classify PolII associated genes according to the ratios of the signal 
intensities for each annotated gene in the two compared samples; shown is the SOTA before and after quantile normalization. The different 
SOTA-predicted classes are catalogued according to their relative PolII binding characteristics: constitutive (C: ~65%), downregulated (D: ~20%) 
and upregulated (U: ~15%). 
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Figure 3. Sub-classification of RNA PolII binding by TSS-gene body profiling. (A) Schematic representation of the conceptual PolII binding patterns 
in a comparative analysis (Ut-Cb upregulated signal intensity at TSS, constitutive gene body; Ct-Ub, constitutive TSS and increased signal intensity 
at gene body; U, global increase of the signal intensity; Dt-Cb, decrease in signal intensity at the TSS and constitutive pattern at gene body; Ct-Db, 
signal intensity constitutive at TSS and decreased at gene body; D, global decrease of signal intensity). (B) Illustration of the different PolII binding 
patterns from comparing Llrep2A with Erepl. CAST (similarity cutoff: 0.9) has been used to perform the initial classification, followed by an 
intuitive association of classes into those depicted in (a). (C) Frequency of the different transcriptional binding patterns indicated in (a) from 
comparing Llrepl or Llrep2A with Erepl. 



RESULTS 

Meta analysis of Caenorhabditis elegans RNA PolII 
chromatin association characteristics at different 
developmental stages 

To test and validate our approach, we have applied 
POLYPHEMUS to a publicly available C. elegans 
(GEO accession number: GSE 15628) data set that 
contains ChlP-seq profiles for PolII binding performed 
in the context of a study of PHA-4/FOXA transcription 
factor binding in embryonic and starved LI larval stages 
(12). We chose this data set, as a plethora of controls are 



provided for each PolII ChlP-seq sample. Specifically, the 
authors have generated two biological replicates for 
embryo and LI larval stages. In addition, two technical 
replicates are reported for one of the LI biological repli- 
cates. Hereafter, these different RNA PolII ChlP-seq 
samples are referred as Embryo-rep 1 (Erepl), 
Embryo-rep2 (Erep2), Larvae Ll-repl (Llrepl) and 
Ll-rep2A (Llrep2A) as well as Ll-rep2B (Llrep2B; 
Figure 1A). In addition, a signal intensity profile 
obtained by compiling the mappable reads in L2A and 
L2B (Llrep2A + B) is included for comparison 
(Figure 1A). 
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Non-parametric normalization methods implemented for 
proper comparison of RNA PolII ChlP-seq binding 
profiles 

As expected, the comparison of PolII signal intensity 
profiles obtained for larval LI and embryo stages reveals 
different patterns (Figure 1A). Indeed, gains and losses of 
PolII occupancy (red and green arrow, respectively) 
signify differential transcriptional activities at the 
associated coding region. Although visual inspection of 
the illustrated ChlP-seq profile supports this interpret- 
ation, a global comparison of all data points clearly 
reveals the need for data normalization. Indeed, the 
TMR differences between samples required a display of 
the wiggle intensity profiles at different scales to allow for 
a visual inspection (Figure 1A; compare Llrep2A: ~8 
million, scale 200, and Llrepl: ~2 million reads, scale 
70). Obviously, such an adjustment bears the risk that in 
a comparison of samples apparent differences for the 
observed peak intensities of a given PolII binding site 
are due to technical variability (TMRs differences) 
rather than the consequence of regulation. 

While such comparative analysis strongly suggest the 
necessity of data normalization, only an analytical 
approach, like the MA transformation of the profile- 
associated SWIs may reveal whether a linear correction 
or a more sophisticated approach suffices for correcting 
the described differences. As shown in Figure IB, even 
biological replicates with small TMR differences can 
exhibit an important offset behaviour in their MA plots 
towards high signal intensity values; this aberration in- 
creases with increasing TMR difference (Figure IB; 
compare top panel Erep2 versus Erepl with a difference 
of ~1 million reads and bottom panel Llrep2A versus 
Llrepl with a difference of ~6 million for the TMR). 
Note that the MA plots of the technical replicates 
Llrep2B and Llrep2A (ATMR: 26749 reads) reveal a 
well-centred pattern relative to the x-axis. The above 
analysis reveals that in this particular situation, a linear 
scaling approach is not suitable for normalization when 
comparing samples, even if TMR differences are as low as 
1 million. For this reason, we have implemented LOWESS 
and quantile normalization as user-defined options in 
POLYPHEMUS to generate cohorts with comparable 
data distribution. (Figure IB middle and right panels). 

Monitoring differential chromatin association of RNA 
PolII at different developmental stages 

The main interest in comparing signal intensity levels of 
PolII tracings is to infer differences in the transcriptional 
features of related biological materials. Zhong et al. (12) 
compared two different developmental stages in 
C. elegans, larval LI and embryo stages, to assess chro- 
matin localization of the transcription factor PHA-4/ 
FOXA and correlate its locus-specific binding with the 
regulation of gene expression. For this, they pooled the 
PolII ChlP-seq data sets from biological replicates, 
followed by linear TMR-based normalization of signal 
intensity profiles. To evaluate the need for linear or 
non-linear data normalization, when comparing such 
samples from different developmental stages, we have 



performed MA plots for high and low TMR differences 
(Figure 2; Llrep2A versus Erepl, ATMR ~6 million; 
Llrepl versus Erepl, ATMR ~34000). A major diver- 
gence of data scatters towards high signal intensities is 
obvious for the high ATMR samples (Figure 2A, top 
left) but even at low ATMR, the MA plots reveal a 
slight offset at both low and high signal intensities 
(Figure 2B, top left). Both these aberrations were efficient- 
ly corrected by quantile normalization (Figure 2A and B, 
top right). Given that Llrep2A and Llrepl are biological 
replicates, a similar differential PolII binding pattern at 
the TSS and along the gene bodies would be expected 
when comparing them separately to the embryo data. To 
validate this assumption, we applied the Self-Organizing 
Tree Algorithm [SOTA; Euclidean distance; max 
cycles = 9; cell variability P = 0.01 (21)] in MeV (20) to 
classify PolII binding patterns with or without prior nor- 
malization by quantile. Without normalization, the com- 
parison between Llrep2A and Erepl would indicate a 
globally increased PolII binding to the majority of loci 
(Figure 2A, bottom left); this is a consequence of the 
overall higher signal amplitudes of the Llrep2A data set 
(Figure 2A, top left). After quantile normalization, the 
SOTA analysis visualizes the differential PolII binding 
patterns in both cases. Indeed, around 2500 genes 
(~20% of the analysed genes) revealed a downregulation 
of PolII binding, whereas 2000 genes (~15% of the 
analysed genes) showed an increase when the embryo 
turns into a LI larvae. Most notably, comparing Erepl 
with either of the biological replicates Llrepl or Llrep2 
yielded similar results after quantile normalization. 
Importantly, linear correction in case of high TMR differ- 
ences (Llrep2A versus Erepl) produces aberrant readouts 
as illustrated in Supplementary Figure S6A. 

Classifying RNA PolII binding characteristics at 
coding regions 

The association of PolII with genes follows very complex 
patterns, which reveal aspects of its chromatin association 
and processivity; for instance, PolII may get bound to 
promoters and remain stalled, it may engage in transcrip- 
tion at low rate with high promoter occupancy or all 
promoter-bound PolII may transcribe the corresponding 
gene leaving an 'empty' promoter behind. Intermediates 
between these extremes are likely to exist, given that 
various effectors, such as PolII-modifying enzymes, 
factors involved in elongation or ncRNAs regulate 
PolII-mediated transcription (22). 

PolII ChlP-seq profiles are assembled snapshots from a 
large number of cells visualizing the regulated and 
dynamic chromatin interaction of enzymes; for each cell, 
they derive from one or more gene-specific PolII functions 
that comprise events like promoter loading/TSS occu- 
pancy, PolII stalling or travelling along the gene during 
active transcription or transcription termination. Specific 
features of a promoter/gene, such as bidirectionality, 
may have additional impact on PolII binding characteris- 
tics. Monitoring the patterns, dynamics and extent of 
PolII association and correlating these data with PolII 
function will reveal a readout of genome-wide PolII 
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transcriptional activity in a given experimental or cell 
physiological setting. 

Conceptually, the separate monitoring of PolII binding 
at the TSS and the gene body may display one of three 
ChlP-seq profiles for any given gene (Figure 3A): (i) TSS 
and coding region are deprived of RNA PolII, (ii) only the 
TSS is occupied and (iii) both TSS and gene body are 
occupied, or any intermediate thereof. Applied to a com- 
parative situation, like induced gene activation, patterning 
the ChlP-seq profiles will differentiate between (i) situ- 
ations where only the TSS shows higher occupancy in 
one of the data sets but no differences over the coding 
region ('Ut-Cb' for upregulated TSS-constitutive body); 
(ii) only the gene body that displays higher occupancy in 
one data set and no changes are seen at the TSS ('Ct-Ub' 
for constitutive TSS-upregulated body); (iii) both the TSS 
and the body show higher PolII occupancy in one data set 
('U' for upregulated); (iv) the TSS is deprived of PolII in 
one data set without any changes over the coding region 
('Dt-Cb' for downregulated TSS-constitutive body); (v) 
the body shows higher occupancy in one data set and no 
differential behaviour at the TSS ('Ct-Db' for constitutive 
TSS-cownregulated body); and finally (vi) both the TSS 
and the body are deprived of PolII in one data set ('D' for 
downregulated). 

While some of the above scenarios can be revealed by 
SOTA analysis (Figure 2A and B, bottom panels), a clas- 
sification approach that does not predefine the number of 
classes can lead to a more refined set of PolII association 
patterns. To this end, we used the Cluster Affinity Search 
Technique [CAST; Euclidean distance; threshold affinity 
value = 0.9 (23)] that is implemented in MeV (20). 
Figure 3B illustrates the CAST analysis for the compari- 
son between Llrep2A and Erepl after quantile normaliza- 
tion. This unsupervised clustering generated some 40 
classes that can be intuitively reorganized in the 
above-described conceptual patterns. The upregulated 
and downregulated PolII binding events previously 
revealed by SOTA analysis are now retrieved as additional 
subclasses (U: -9-13%; Ut-Cb: -2-3%; Ct-Ub: -2%; D: 
-8-9%; Dt-Cb: -0.5%; Ct-Db:~l% in Figure 3C). Note 
that applying CAST separately to the two biological rep- 
licates revealed essentially the same classification, as pre- 
viously demonstrated using SOTA. 

RNA PolII binding characteristics during F9 cell 
differentiation 

That POLYPHEMUS is designed to compare multiple 
data sets makes it the method of choice to analyse 
temporal PolII binding kinetics at a genome-wide level. 
To this aim, we used the well-characterized retinoid- 
induced F9 mouse embryonal carcinoma (EC) cell differ- 
entiation model [reviewed in ref. (24)]. Samples were col- 
lected during the first 48 h of all-trans retinoic acid 
(ATRA) treatment and processed for ChlP-seq analysis 
of PolII binding (Figure 4A; Supplementary File SI for 
details). Alignment of the sequenced reads against the 
mouse genome (mm9) yielded 4-6 million TMR for all 
samples (Figure 4B). As expected, MA plots for all time 
points relative to vehicle presented variable degrees of 



offset behaviour, revealing the need for normalization; 
the corresponding multicomparison quantile normaliza- 
tion is depicted in Figure 4C (bottom panel). 

We used supervised clustering to classify genes into dif- 
ferent patterns of relative PolII binding during cell differ- 
entiation. The corresponding SOTA [Euclidean distance; 
max cycles = 9; cell variability P = 0.01 (21)], reveals im- 
portant changes during differentiation (Figure 5A). The 
relative abundance of the various PolII chromatin 
binding classes changes rapidly during differentiation, 
revealing a highly dynamic recruitment/dissociation and/ 
or processivity of PolII (Figure 5B). Two hours after 
ATRA treatment approximately 900 genes show increased 
PolII binding (P-value confidence: 0.05; Supplementary 
Figure S7). At 6h, the proportion of genes with higher 
PolII binding pattern is subdivided in two groups, the 
upregulated group or 'U' characterized by significant 
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Figure 4. Analysis of the kinetics of RNA PolII binding during differ- 
entiation (A). Illustration of differentiation system. F9 teratocarcinoma 
cells were treated with ATRA for 48 h to induce primitive endodermal 
differentiation. Samples were collected at 0, 2, 6, 24 and 48 h and 
ChlPed with anti-PolII antibodies for ChlP-seq analyses. (B) Table 
illustrating the TMR per data set. (C) MA plots of data sets before 
and after quantile normalization. All samples were normalized relative 
to the Oh control sample. Note that, as in the case of the C. degans 
data sets (Figures 2 and 3) all data sets require non-linear normaliza- 
tion, as is obvious from the LOWESS fitted line (red) in the 
prenormalization MA plots. 



e30 Nucleic Acids Research, 2012, Vol. 40, No. 4 



Page 8 of 1 1 



PolII ratio levels at the TSS as well as the gene body (558 
genes) and that characterized by significant PolII ratio 
levels preferentially at the TSS (Ut-Cb; 529 genes). At 
24 h, the number of genes revealing upregulated RNA 
PolII binding decrease dramatically (approximately 200 
genes are classified as 'IT and approximately 700 genes 
as 'Ut-Cb,' respectively). However, this trend is reversed 
48 h after ATRA treatment. At this time, more than 2500 
genes display significant PolII levels (relative to 0 h) pref- 
erentially localized at the TSS (Ut-Cb). This biphasic PolII 
recruitment pattern correlates with two transcription 



peaks previously reported for F9 cell differentiation (25). 
Note that in contrast to PolII recruitment, the number of 
genes that lose PolII binding (relative to the 0 h) remained 
rather constant, suggesting that the majority of loci are 
'poised' for transcription activation by early or constitu- 
tive recruitment of PolII. 

Notably, SOTA classification not only identifies recruit- 
ment, loss and constitutive binding of PolII, but it shows 
binding patterns that diverge between TSS and the body 
of the coding regions, as illustrated for Nanog, Stra8 and 
Cdv3 (Figure 5C). The PolII ChlP-seq signal intensity 



A 2h vs 0h 



6h vs Oh 



# genes 



24h vs Oh 



# genes 



48h vs Oh 



# genes 




2h/0h 6h/0h 24h/0h 48h/0h 



Figure 5. Dynamic chromatin association patterns of RNA PolII during ATRA-induced differentiation. (A) POLYPHEMUS (QUANTILE)- 
normalized ATRA-treated data sets were classified relative to the nontreated control using SOTA (max cycles = 9; cell variability P = 0.01). The 
associated heat maps to the SOTA analyses illustrate the presence of upregulated (red), constitutive (black) and downregulated (green) PolII binding 
at identified loci based on at least 2a distance away from the global mean behaviour (Supplementary Figure S7). (B) The SOTA analysis of (A) was 
classified according to the occupancy of TSS, gene body or both and the number of coding regions per classes is depicted. The nomenclature is as in 
Figure 3. (C) Three examples illustrate the highly dynamic chromatin association of PolII during F9 cell differentiation. The signal intensity profiles 
(top panel) are compared with the normalized gene representation (middle panel). The corresponding qPCR validation performed at the TSS and at a 
defined region of the coding sequence is depicted in the bottom panel. The POLYPHEMUS representation (middle panel) displays the coding region 
in the X-axis (the TSS and the Body regions are delimited) and the fold induction of PolII binding levels at a given time point of ATRA treatment 
relative to the non-treated control. 
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profiles demonstrate that, whereas the presence of tran- 
scriptional activity on such regions is evident, the differ- 
ences between time points are rather difficult to assess 
(Figure 5C, top panels). With POLYPHEMUS, the 
relative differences between time points become readily 
apparent (Figure 5C, middle panels). Using quantitative 
real-time PCR (qPCR) of ChlPed PolII binding sites at 
the TSS or a position inside the corresponding coding 
region reveals a good correlation with the computational 
analysis by POLYPHEMUS (Figure 5C, bottom panels). 
For example, the PolII levels present at the coding region 
of Nanog appeared to be strongly induced in the first 2 h of 
ATRA treatment and steadily decreasing thereafter. This 
aspect was revealed both by ChlP-qPCR analysis and 
POLYPHEMUS, while it is not evident from the 
original signal intensity profiles. Similarly, ChlP-qPCR 
for StraS and Cdv3 correlated with POLYPHEMUS com- 
putation revealing a maximal recruitment of PolII to the 
TSSs at 24 h of ATRA treatment. Additional features of 
PolII association along gene loci are apparent from the 
analysis and warrant further attention for mechanistic 
analyses, such as a progressive increase (Nanog, 24 and 
48 h), decrease (Cdv3, 6 and 24 h) or U-shaped (Nanog, 
2h; Stra8, 2, 24 and 48 h in contrast to the 6h pattern) 
binding of PolII along the coding region. Such features are 
likely revealing functional aspects of (regulated) PolII 
binding to chromatin and/or features of (regulated) tran- 
scription in the context of chromatin structure and regu- 
latory machineries. 



DISCUSSION 

ChIP sequencing is the current method of choice to define 
and compare genome-wide chromatin binding patterns of 
regulatory factors, enzymes, non-coding RNAs and 
chromatin-modifying/transcriptional machineries, as well 
as posttranslational modifications of chromatin constitu- 
ents and coregulatory factors, irrespective of whether they 
bind directly or indirectly to chromatin. In the majority of 
cases, different data sets serve as a way to identify chro- 
matin regions occupied by several components, but the 
relative signal intensities associated to the chromatin 
regions of interest are in most of the cases neglected. As 
signal intensity profiles are generated from the number of 
reads that overlap in a given window, there is a direct 
correlation between the TMR and the amplitude of the 
signal intensity profile. Previous studies addressing the in- 
fluence of sequencing depth on the accuracy of binding site 
annotation showed that the number of identified sites 
increase with increasing TMR and provided evidence of 
factor-specific saturation levels. For PolII, saturation at 
promoter regions is attained beyond 3 million TMR, 
while it is not reached even at 20 million TMR for the 
transcription factor STAT1 (26). 

One of the most exciting features of ChlP-seq ana- 
lyses is the possibility to compare different conditions 
linked to a particular biological/mechanistic question, 
such as the dynamics of transcription factor binding 
to its cognate targets during a biological process (cell 
differentiation, oncogenic transformation, developmental 



processes, etc.). The problem in such a comparison is the 
variability of the technique itself. We demonstrate this 
issue in a meta-analysis of an extensive data set from 
C. elegans in which two different developmental stages 
were compared (12). While the technical replicates con- 
firmed the high reproducibility of the sequencing technol- 
ogy, a comparison between biological replicates revealed 
that even rather small TMR differences can lead to sig- 
nificant non-linear offset behaviour of the corresponding 
data sets in a comparative MA plot, emphasizing the need 
for data normalization before comparison. We show 
that linear normalization inadequately addresses the 
problem, while the non-parametric normalization proced- 
ure integrated in the POLYPHEMUS package reliably 
correct for the aberrant data scattering. Moreover, we 
demonstrate that the dynamic chromatin association of 
PolII during cell differentiation can be accurately moni- 
tored after data normalization with POLYPHEMUS. 

In addition to defining global patterns of binding sites, 
ChlP-seq profiles contain a wealth of additional informa- 
tion. In particular for PolII, the ChlP-seq signal intensity 
profiles originate from a plethora of regulatory inputs at 
the TSS and the travelling of the enzyme along the entire 
transcription unit. Indeed, both transcription initiation 
and transcript elongation are highly regulated events 
(22,27,28). At the TSS PolII recruitment, chromatin modi- 
fication and structure, preinitiation complex formation 
and a multitude of other regulatory events control phe- 
nomena like PolII binding, stalling or promoter escape, 
while events such as post-translational modification, asso- 
ciation of elongation factors and non-coding RNAs, 
regulate the travelling and transcript production. ChlP- 
seq profiles provide readouts for several of these phenom- 
ena by revealing among others aspects like promoter clear- 
ance, PolII pausing or dissociation. POLYPHEMUS 
facilitates such analyses, as it generates and visualizes 
normalized ChlP-seq signal intensity profiles at the TSS 
and along the gene body as exemplified in Figure 5C 
(middle panels). Indeed, we were surprised by the 
gene-specific variability of these profiles, which are likely 
to reflect the regulatory events affecting PolII-chromatin 
interaction in a dynamic, gene- and cell-specific manner. 
We are aware of the possibility that some functional 
aspects of PolII action may readout in the ChIP profiles 
through indirect effects; in this respect, PolII mobility 
(PolII stalling versus travelling, travelling speed, integra- 
tion in chromatin-associated complexes, etc.) may alter 
the efficiency by which it is crosslinked to a given locus. 

As for normalization of gene expression data obtained 
by microarray technologies, quantile normalization by 
POLYPHEMUS relies on the assumption of a common 
RNA PolII-binding distribution at the majority of genes 
investigated, which is maintained across the compared 
experimental conditions (29). Interestingly, a recent 
study applied quantile normalization for correcting differ- 
ences in TMR profiles associated with RNA-seq assays 
and demonstrated an improved decrease in the bias of 
monitoring differential gene expression relative to qRT- 
PCR 'gold standard' measurements when compared with 
other linear correction approaches (3). 
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With the rapid development of technologies that 
provide dramatically increasing sequencing depths, 
protein-chromatin interaction studies will evolve to inte- 
grate quantitative aspects of binding/enrichment along the 
genome. This study illustrates the necessary steps for such 
analyses in the case of RNA Polll. The general concepts 
underlying POLYPHEMUS can be extrapolated to other 
chromatin interactor studies, such as histone modification 
profiling. Furthermore, the POYPHEMUS pipeline can 
be combined with other computational efforts like those 
focused on promoter identification (18,30). Future 
versions may include additional statistical normalization 
methods that requiring less assumptions concerning data 
distribution, like ANOVA-type models (31), with the aim 
of expanding its use to other protein-chromatin 
interactors. 

AVAILABILITY 

POLYPHEMUS is currently available at http://igbmc.fr/ 
Gronemeyer_Polyphemus and on the CRAN network 
(http://cran.r-project.org/web/packages/polyphemus/). A 
Bioconductor-compliant package of POLYPHEMUS is 
being assembled and will be available in spring 2012. 

ACCESSSION NUMBER 

ChlP-seq data for the temporal characterisation of RNA 
Polymerase II binding on the F9 model system (5) has 
been deposited in GEO under accession number 
GSE30539. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Figures 1-7, Supplementary File 1. 
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